{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "第8讲 Logistic回归与最大熵模型\n",
    "===\n",
    "主讲教师：高鹏\n",
    "---\n",
    "办公地点：网络空间安全学院407\n",
    "---\n",
    "联系方式：pgao@qfnu.edu.cn\n",
    "---\n",
    "面向专业：软件工程（智能数据）\n",
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 兼容python2和python3\n",
    "from __future__ import print_function\n",
    "\n",
    "import numpy as np\n",
    "import scipy as sp\n",
    "import pandas as pd\n",
    "import sklearn\n",
    "import math\n",
    "import matplotlib.pyplot as plt\n",
    "from collections import Counter\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 引言\n",
    "\n",
    "Logistic回归是统计学习中的经典分类方法，最大熵是概率模型学习的一个准则，将其推广到分类问题得到最大熵模型（maximum entropy model）。Logistic回归模型与最大熵模型都属于对数线性模型。本讲首先介绍Logistic回归模型，然后介绍最大熵模型，最后讲述Logistic回归与最大熵模型的学习算法，包括改进的迭代尺度算法、拟牛顿法和梯度下降法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 8.1 Logistic回归模型\n",
    "\n",
    "## 8.1.2 Logistic分布\n",
    "\n",
    "首先介绍Logistic分布（logistic distribution）。\n",
    "\n",
    "**定义1（Logistic分布）**  设$X$是连续随机变量，$X$服从Logistic分布是指$X$具有下列分布函数和密度函数：\n",
    "\n",
    "$$\n",
    "F(X)=P(X\\leq x)=\\frac{1}{1+e^{-(x-\\mu)/\\gamma}}\n",
    "$$\n",
    "$$\n",
    "f(x)=F^\\prime (x)=\\frac{e^{-(x-\\mu)/\\gamma}}{\\gamma(1+e^{-(x-\\mu)/\\gamma)^2}}\n",
    "$$\n",
    "\n",
    "式中，$\\mu$为位置参数，$\\gamma>0$为形状参数。\n",
    "\n",
    "<p align=\"center\">\n",
    "  <img width=\"400\" src=\"Lesson8-1.jpg\">\n",
    "</p>\n",
    "\n",
    "Logistic分布的密度函数$f(x)$和分布函数$F(X)$的图形如上图所示。分布函数属于Logistic函数，其图形是一条S形曲线（sigmoid curve）。该曲线以点$(\\mu,\\frac{1}{2})$为中心对称，即满足\n",
    "\n",
    "$$\n",
    "F(-x+\\mu)-\\frac{1}{2}=-F(x-\\mu)+\\frac{1}{2}\n",
    "$$\n",
    "\n",
    "曲线在中心附近增长速度较快，在两端增长速度较慢。形状参数$\\gamma$的值越小，曲线在中心附近增长得越快。\n",
    "\n",
    "## 8.1.2 二项Logistic回归模型\n",
    "\n",
    "二项Logistic回归模型（binomial logistic regression model）是一种分类模型，由条件概率分布$P(Y|X)$表示，形式为参数化的Logistic分布。这里，随机变量$X$取值为实数，随机变量$Y$取值为1或0。我们通过监督学习的方法来估计模型参数。\n",
    "\n",
    "**定义2（Logistic回归模型）**  二项Logistic回归模型是如下的条件概率分布：\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "P(Y=1 | x)&=\\frac{\\exp(w\\cdot x+b)}{1+\\exp(w\\cdot x+b)}\\\\\n",
    "P(Y=0|x)&=\\frac{1}{1+\\exp(w\\cdot x+b)}\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "这里，$x\\in\\mathbb{R}^n$是输入，$Y\\in\\{0,1\\}$是输出，$w\\in\\mathbb{R}^n$和$b\\in\\mathbb{R}$是参数，$w$称为权值向量，$b$称为偏置，$w\\cdot x$为 $w$和$x$的内积。\n",
    "\n",
    "对于给定的输入实例$x$，可以求得$P(Y=1|x)$和$P(Y=0|x)$。Logistic回归比较两个条件概率值的大小，将实例$x$分到概率值较大的那一类。\n",
    "\n",
    "有时为了方便，将权值向量和输入向量加以扩充，仍记作$w$，$x$，即$w=(w^{(1)},w^{(2)},\\ldots,w^{(n)},b)^T$，$x=(x^{(1)},x^{(2)},\\ldots,x^{(n)},1)^T$。这时，Logistic回归模型如下：\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "P(Y=1|x)&=\\frac{\\exp(w\\cdot x)}{1+\\exp(w\\cdot x)}\\\\\n",
    "P(Y=0|x)&=\\frac{1}{1+\\exp(w\\cdot x)}\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "现在考查Logistic回归模型的特点。一个事件的几率（odds）是指该事件发生的概率与该事件不发生的概率的比值。如果事件发生的概率是p，那么该事件的几率是$\\frac{p}{1-p}$，该事件的对数几率（log odds）或logit函数是\n",
    "\n",
    "$$\n",
    "\\operatorname{logit}(p)=\\log\\frac{p}{1-p}\n",
    "$$\n",
    "\n",
    "对Logistic回归而言，有\n",
    "\n",
    "$$\n",
    "\\log\\frac{P(Y=1|x)}{1-P(Y=1|x)}=w\\cdot x\n",
    "$$\n",
    "\n",
    "这就是说，在Logistic回归模型中，输出$Y=1$的对数几率是输入x的线性函数。或者说，输出$Y=1$的对数几率是由输入$x$的线性函数表示的模型，即Logistic回归模型。\n",
    "\n",
    "换一个角度看，考虑对输入$x$进行分类的线性函数$w\\cdot x$，其值域为实数域。\n",
    "\n",
    "注意，这里$x\\in\\mathbb{R}^{n+1}$，$w\\in\\mathbb{R}^{n+1}$。通过Logistic回归模型的定义可以将线性函数$w\\cdot x$转换为概率\n",
    "\n",
    "$$\n",
    "P(Y=1|x)=\\frac{\\exp(w\\cdot x)}{1+\\exp(w\\cdot x)}\n",
    "$$\n",
    "\n",
    "这时，线性函数的值越接近正无穷，概率值就越接近1；线性函数的值越接近负无穷，概率值就越接近0。这样的模型就是Logistic回归模型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from math import exp\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "from sklearn.datasets import load_iris\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# data\n",
    "def create_data():\n",
    "    iris = load_iris()\n",
    "    df = pd.DataFrame(iris.data, columns=iris.feature_names)\n",
    "    df['label'] = iris.target\n",
    "    df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label']\n",
    "    data = np.array(df.iloc[:100, [0,1,-1]])\n",
    "    # print(data)\n",
    "    return data[:,:2], data[:,-1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X, y = create_data()\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "class LogisticReressionClassifier:\n",
    "    def __init__(self, max_iter=200, learning_rate=0.01):\n",
    "        self.max_iter = max_iter\n",
    "        self.learning_rate = learning_rate\n",
    "\n",
    "    def sigmoid(self, x):\n",
    "        return 1 / (1 + exp(-x))\n",
    "\n",
    "    def data_matrix(self, X):\n",
    "        data_mat = []\n",
    "        for d in X:\n",
    "            data_mat.append([1.0, *d])\n",
    "        return data_mat\n",
    "\n",
    "    def fit(self, X, y):\n",
    "        # label = np.mat(y)\n",
    "        data_mat = self.data_matrix(X)  # m*n\n",
    "        self.weights = np.zeros((len(data_mat[0]), 1), dtype=np.float32)\n",
    "\n",
    "        for iter_ in range(self.max_iter):\n",
    "            for i in range(len(X)):\n",
    "                result = self.sigmoid(np.dot(data_mat[i], self.weights))\n",
    "                error = y[i] - result\n",
    "                self.weights += self.learning_rate * error * np.transpose(\n",
    "                    [data_mat[i]])\n",
    "        print('LogisticRegression Model(learning_rate={},max_iter={})'.format(\n",
    "            self.learning_rate, self.max_iter))\n",
    "\n",
    "    # def f(self, x):\n",
    "    #     return -(self.weights[0] + self.weights[1] * x) / self.weights[2]\n",
    "\n",
    "    def score(self, X_test, y_test):\n",
    "        right = 0\n",
    "        X_test = self.data_matrix(X_test)\n",
    "        for x, y in zip(X_test, y_test):\n",
    "            result = np.dot(x, self.weights)\n",
    "            if (result > 0 and y == 1) or (result < 0 and y == 0):\n",
    "                right += 1\n",
    "        return right / len(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "lr_clf = LogisticReressionClassifier()\n",
    "lr_clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "lr_clf.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "x_ponits = np.arange(4, 8)\n",
    "y_ = -(lr_clf.weights[1]*x_ponits + lr_clf.weights[0])/lr_clf.weights[2]\n",
    "plt.plot(x_ponits, y_)\n",
    "\n",
    "#lr_clf.show_graph()\n",
    "plt.scatter(X[:50,0],X[:50,1], label='0')\n",
    "plt.scatter(X[50:,0],X[50:,1], label='1')\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8.1.3 模型参数估计\n",
    "\n",
    "Logistic回归模型学习时，对于给定的训练数据$T=\\{(x_1,y_1),(x_2,y_2),\\ldots,(x_N,y_N)\\}$，其中，$x_i\\in\\mathbb{R}^n$，$y_i\\in\\{0,1\\}$，可以应用极大似然估计法估计模型参数，从而得到Logistic回归模型.\n",
    "\n",
    "设\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "P(Y=1|x)&=\\pi(x)\\\\\n",
    "P(Y=0|x)&=1-\\pi(x)\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "似然函数为\n",
    "\n",
    "$$\n",
    "\\prod_{i=1}^N[\\pi(x_i)]^{y_i}[1-\\pi(x_i)]^{1-y_i}\n",
    "$$\n",
    "\n",
    "对数似然函数为\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "L(w)&=\\sum_{i=1}^N[y_i\\log\\pi(x_i)+(1-y_i)\\log(1-\\pi(x_i))]\\\\\n",
    "&=\\sum_{i=1}^N[y_i\\log\\frac{\\pi(x_i)}{1-\\pi(x_i)}+\\log(1-\\pi(x_i))]\\\\\n",
    "&=\\sum_{i=1}^N[y_i(w\\cdot x_i)-\\log(1+\\exp(w\\cdot x_i))]\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "对$L(w)$求极大值，得到$w$的估计值.\n",
    "\n",
    "这样，问题就变成了以对数似然函数为目标函数的最优化问题. Logistic回归学习中通常采用的方法是梯度下降法及拟牛顿法.\n",
    "\n",
    "假设$w$的极大似然估计值是$\\hat{w}$，那么学到的Logistic回归模型为\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "P(Y=1|x)&=\\frac{\\exp(\\hat{w}\\cdot x)}{1+\\exp(\\hat{w}\\cdot x)}\\\\\n",
    "P(Y=0|x)&=\\frac{1}{1+\\exp(\\hat{w}\\cdot x)}\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "## 8.1.4 多项Logistic回归\n",
    "\n",
    "上面介绍的Logistic回归模型是二项分类模型，用于二类分类.可以将其推广为多项Logistic回归模型（multi-nominal logistic regression model），用于多类分类. 假设离散型随机变量$Y$的取值集合是$\\{1,2,\\ldots,K\\}$，那么多项Logistic回归模型是\n",
    "\n",
    "$$\n",
    "P(Y=k|x)=\\frac{\\exp(w_k\\cdot x)}{1+\\sum_{k=1}^{K-1}\\exp(w_k\\cdot x)},\\;k=1,2,\\ldots,K-1\n",
    "$$\n",
    "\n",
    "$$\n",
    "P(Y=K|x)=\\frac{1}{1+\\sum_{k=1}^{K-1}\\exp(w_k\\cdot x)},\\;k=1,2,\\ldots,K-1\n",
    "$$\n",
    "\n",
    "这里，$x\\in\\mathbb{R}^{n+1}$，$w_k\\in\\mathbb{R}^{n+1}$.\n",
    "\n",
    "二项Logistic回归的参数估计法也可以推广到多项Logistic回归.\n",
    "\n",
    "# 8.2 最大熵模型\n",
    "\n",
    "最大熵模型（maximum entropy model）由最大熵原理推导实现，这里首先叙述一般的最大熵原理，然后进解最大熵模型的推导，最后给出最大熵模型学习的形式.\n",
    "\n",
    "## 8.2.1 最大熵原理\n",
    "\n",
    "最大熵原理是概率模型学习的一个准则. 最大熵原理认为，学习概率模型时，在所有可能的概率模型（分布）中，熵最大的模型是最好的模型. 通常用约束条件来确定概率模型的集合，所以，最大熵原理也可以表述为在满足约束条件的模型集合中选取熵最大的模型.\n",
    "\n",
    "假设离散随机变量X的概率分布是$P(X)$，则其熵是\n",
    "\n",
    "$$\n",
    "H(P)=-\\sum_xP(x)\\log P(x)\n",
    "$$\n",
    "\n",
    "熵满足下列不等式\n",
    "\n",
    "$$\n",
    "0\\leq H(P)\\leq \\log|X|\n",
    "$$\n",
    "\n",
    "式中，$|X|$是$X$的取值个数，当且仅当$X$的分布是均匀分布时右边的等号成立. 这就是说，当$X$服从均匀分布时，熵最大.\n",
    "\n",
    "直观地，最大熵原理认为要选择的概率模型首先必须满足已有的事实，即约束条件. 在没有更多信息的情况下，那些不确定的部分都是\"等可能的\". 最大熵原理通过熵的最大化来表示等可能性. \"等可能\"不容易操作，而熵则是一个可优化的数值指标.\n",
    "\n",
    "首先，通过一个简单的例子来介绍一下最大熵原理.\n",
    "\n",
    "---\n",
    "\n",
    "### 例8.1\n",
    "\n",
    "假设随机变量$X$有5个取值$\\{A,B,C,D,E\\}$，要估计取各个值的概率$P(A),P(B),P(C),P(D),P(E)$.\n",
    "\n",
    "**解**  这些概率值满足以下约束条件\n",
    "\n",
    "$$\n",
    "P(A)+P(B)+P(C)+P(D)+P(E)=1\n",
    "$$\n",
    "\n",
    "满足这个约束条件的概率分布有无穷多个. 如果没有任何其他信息，仍要对概率分布进行估计，一个办法就是认为这个分布中取各个值的概率是相等的\n",
    "\n",
    "$$\n",
    "P(A)=P(B)=P(C)=P(D)=P(E)=\\frac{1}{5}\n",
    "$$\n",
    "\n",
    "等概率表示了对事实的无知. 因为没有更多的信息，这种判断是合理的.\n",
    "\n",
    "有时，能从一些先验知识中得到一些对概率值的约束条件，例如\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "&P(A)=P(B)=\\frac{3}{10}\\\\\n",
    "&P(A)+P(B)+P(C)+P(D)+P(E)=1\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "满足这两个约束条件的概率分布仍然有无穷多个，在缺少其他信息的情况下，可以认为$A$与$B$是等概率的，$C$，$D$与$E$是等概率的，于是\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "&P(A)=P(B)=\\frac{3}{20}\\\\\n",
    "&P(C)=P(D)=P(E)=\\frac{7}{30}\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "如果还有第3个约束条件\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "&P(A)+P(C)=\\frac{1}{2}\\\\\n",
    "&P(A)+P(B)=\\frac{3}{10}\\\\\n",
    "&P(A)+P(B)+P(C)+P(D)+P(E)=1\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "可以继续按照满足约束条件下求等概率的方法估计概率分布，这里不再继续讨论. 以上概率模型学习的方法正是遵循了最大熵原理.\n",
    "\n",
    "<p align=\"center\">\n",
    "  <img width=\"400\" src=\"Lesson8-2.jpg\">\n",
    "</p>\n",
    "\n",
    "上图提供了用最大熵原理进行概率模型选择的几何解释．概率模型集合$\\mathcal{P}$可由欧氏空间中的单纯形（simplex）表示，如左图的三角形（2-单纯形）. —个点代表一个模型，整个单纯形代表模型集合.右图上的一条直线对应于一个约束条件，直线的交集对应于满足所有约束条件的模型集合. 一般地，这样的模型仍有无穷多个. 学习的目的是在可能的模型集合中选择最优模型，而最大熵原理则给出最优模型选择的一个准则.\n",
    "\n",
    "---\n",
    "\n",
    "## 8.2.2 最大熵模型的定义\n",
    "\n",
    "最大熵原理是统计学习的一般原理，将它应用到分类得到最大熵模型.\n",
    "\n",
    "假设分类模型是一个条件概率分布$P(Y|X)$，$X\\in\\mathcal{X}\\subseteq\\mathbb{R}^n$表示输入，$Y\\in\\mathcal{Y}$表示输出，$\\mathcal{X}$和$\\mathcal{Y}$分别是输入和输出的集合. 这个模型表示的是对于给定的输入$X$，以条件概率$P(Y|X)$输出$Y$.\n",
    "\n",
    "给定一个训练数据集\n",
    "\n",
    "$$\n",
    "T=\\{(x_1,y_1),(x_2,y_2),\\ldots,(x_N,y_N)\\}\n",
    "$$\n",
    "\n",
    "学习的目标是用最大熵原理选择最好的分类模型.\n",
    "\n",
    "首先考虑模型应该满足的条件. 给定训练数据集，可以确定联合分布$P(X,Y)$的经验分布和边缘分布$P(X)$的经验分布，分别以$\\tilde{P}(X,Y)$和$\\tilde{P}(X)$表示. 这里，\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "&\\tilde{P}(X=x,Y=y)=\\frac{v(X-x,Y=y)}{N}\\\\\n",
    "&\\tilde{P}(X=x)=\\frac{v(X-x)}{N}\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "其中，$v(X-x,Y=y)$表示训练数据中样本$(x,y)$出现的频数，$v(X-x)$表示训练数据中输入$x$出现的频数，$N$表示训练样本容量.\n",
    "\n",
    "用特征函数（feature function）$f(x,y)$描述输入$x$和输出$y$之间的某一个事实. 其定义是\n",
    "\n",
    "$$\n",
    "f(x,y)=\n",
    "\\left\\{\\begin{array}{ll}\n",
    "1, & x与y满足某一事实\\\\\n",
    "0, & 否则\n",
    "\\end{array}\\right.\\\\\n",
    "$$\n",
    "\n",
    "它是一个二值函数，当$x$和$y$满足这个事实时取值为1，否则取值为 0.\n",
    "\n",
    "特征函数$f(x,y)$关于经验分布$\\tilde{P}(X,Y)$的期望值，用$E_{\\tilde{P}}(f)$表示.\n",
    "\n",
    "$$\n",
    "E_{\\tilde{P}}(f)=\\sum_{x,y}\\tilde{P}(x,y)f(x,y)\n",
    "$$\n",
    "\n",
    "特征函数$f(x,y)$关于模型$P(Y|X)$与经验分布$\\tilde{P}(X)$的期望值，用$E_{P}(f)$表示.\n",
    "\n",
    "$$\n",
    "E_{P}(f)=\\sum_{x,y}\\tilde{P}(x)P(x,y)f(x,y)\n",
    "$$\n",
    "\n",
    "如果模型能够获取训练数据中的信息，那么就可以假设这两个期望值相等，即\n",
    "\n",
    "$$\n",
    "E_{P}(f)=E_{\\tilde{P}}(f)\n",
    "$$\n",
    "\n",
    "或\n",
    "\n",
    "$$\n",
    "\\sum_{x,y}\\tilde{P}(x)P(x,y)f(x,y)=\\sum_{x,y}\\tilde{P}(x,y)f(x,y)\n",
    "$$\n",
    "\n",
    "我们将以上两式作为模型学习的约束条件. 假如有$n$个特征函数$f_i(x,y)$，$i=1,2,\\ldots,n$，那么就有$n$个约束条件.\n",
    "\n",
    "**定义8.3（最大熵模型）**  假设满足所有约束条件的模型集合为\n",
    "\n",
    "$$\n",
    "\\mathcal{C}\\equiv\\{P\\in\\mathcal{P}|E_p(f_i)=E_{\\tilde{P}}(f_i),\\;i=1,2,\\ldots,n\\}\n",
    "$$\n",
    "\n",
    "定义在条件概率分布$P(Y|X)$上的条件熵为\n",
    "\n",
    "$$\n",
    "H(P)=-\\sum_{x,y}\\tilde{P}(x)P(x,y)f(x,y)\n",
    "$$\n",
    "\n",
    "则模型集合$\\mathcal{C}$中条件熵$H(P)$最大的模型称为最大熵模型．式中的对数为自然对数.\n",
    "\n",
    "## 8.2.3 最大熵模型的学习\n",
    "\n",
    "最大熵模型的学习过程就是求解最大熵模型的过程，最大熵模型的学习可以形式化为约束最优化问题.\n",
    "\n",
    "对于给定的训练数据集$T=\\{(x_1,y_1),(x_2,y_2),\\ldots,(x_N,y_N)\\}$以及特征函数$f_i(x,y)$，$i=1,2,\\ldots,n$，最大熵模型的学习等价于约束最优化问题\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "\\max_{P\\in\\mathcal{C}}\\quad&H(P)=-\\sum_{x,y}\\tilde{P}(x)P(y|x)f(y|x)\\\\\n",
    "\\operatorname{s.t.}\\quad&E_P(f_i)=E_{\\tilde{P}}(f_i),\\;i=1,2,\\ldots,n\\\\\n",
    "&\\sum_yP(y|x)=1\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "按照最优化问题的习惯，将求最大值问题改写为等价的求最小值问题\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "\\min_{P\\in\\mathcal{C}}\\quad&H(P)=-\\sum_{x,y}\\tilde{P}(x)P(y|x)f(y|x)\\\\\n",
    "\\operatorname{s.t.}\\quad&E_P(f_i)-E_{\\tilde{P}}(f_i)=0,\\;i=1,2,\\ldots,n\\\\\n",
    "&\\sum_yP(y|x)=1\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "求解约束最优化问题，所得出的解就是最大熵模型学习的解. 下面给出具体推导.\n",
    "\n",
    "这里，将约束最优化的原始问题转换为无约束最优化的对偶问题．通过求解对偶问题求解原始问题.\n",
    "\n",
    "首先，引进拉格朗日乘子$w_0,w_1,w_2,\\ldots,w_n$，定义拉格朗日函数$L(P,w)$\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "L(P,w)\\equiv&-H(P)+w_0(1-\\sum_yP(y|x))+\\sum_{i=1}^nw_i(E_{\\tilde{P}}(f_i)-E_p(f_i))\\\\\n",
    "=&\\sum_{x,y}\\tilde{P}(x)P(y|x)f(y|x)+w_0(1-\\sum_yP(y|x))\\\\\n",
    "&+\\sum_{i=1}^nw_i(\\sum_{x,y}\\tilde{P}(x,y)f_i(x,y)-\\sum_{x,y}\\tilde{P}(x)P(y|x)f_i(y|x))\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "最优化的原始问题是\n",
    "\n",
    "$$\n",
    "\\min_{P\\in\\mathcal{C}}\\;\\max_{w}L(P,w)\n",
    "$$\n",
    "\n",
    "对偶问题是\n",
    "\n",
    "$$\n",
    "\\max_{w}\\;\\min_{P\\in\\mathcal{C}}L(P,w)\n",
    "$$\n",
    "\n",
    "由于拉格朗日函数$L(P,w)$是$P$的凸函数，原始问题的解与对偶问题的解是等价的. 这样，可以通过求解对偶问题来求解原始问题.\n",
    "\n",
    "首先，求解对偶问题内部的极小化问题$\\min_{P\\in\\mathcal{C}}L(P,w)$. $\\max_{w}L(P,w)$是$w$的函数，将其记作\n",
    "\n",
    "$$\n",
    "\\Psi(w)=\\min_{P\\in\\mathcal{C}}L(P,w)=L(P_w,w)\n",
    "$$\n",
    "\n",
    "$\\Psi(w)$称为对偶函数. 同时，将其解记作\n",
    "\n",
    "$$\n",
    "p_w=\\arg\\min_{P\\in\\mathcal{C}}L(P,w)=P_w(y|x)\n",
    "$$\n",
    "\n",
    "具体地，求$L(P,w)$对$P(y|x)$的偏导数\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "\\frac{\\partial L(P,w)}{\\partial P(y|x)}&=\\sum_{x,y}\\tilde{P}(x)(\\log P(y|x)+1)-\\sum_yw_0-\\sum_{x,y}(\\tilde{P}(x)\\sum_{i=1}^nw_if_i(x,y))\\\\\n",
    "&=\\sum_{x,y}\\tilde{P}(x)(\\log P(y|x)+1-w_0-\\sum_{i=1}^nw_if_i(x,y))\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "令偏导数等于0，在$\\tilde{P}(x)>0$的情况下，解得\n",
    "\n",
    "$$\n",
    "P(y|x)=\\exp(\\sum_{i=1}^nw_if_i(x,y)+w_0-1)=\\frac{\\exp(\\sum_{i=1}^nw_if_i(x,y))}{\\exp(1-w_0)}\n",
    "$$\n",
    "\n",
    "由于$\\displaystyle \\sum_yP(y|x)=1$，得\n",
    "\n",
    "$$\n",
    "P_w(y|x)=\\frac{1}{Z_w(x)}\\exp(\\sum^n_{i=1}w_if_i(x,y))\n",
    "$$\n",
    "\n",
    "其中，\n",
    "\n",
    "$$\n",
    "Z_w(x)=sum_y\\exp(\\sum^n_{i=1}w_if_i(x,y))\n",
    "$$\n",
    "\n",
    "$Z_w(x)$称为规范化因子; $f_i(x,y)$是特征函数; $w_i$是特征的权值.$P_w=P_w(y|x)$就是最大熵模型. 这里，$w$是最大熵模型中的参数向量.\n",
    "\n",
    "之后，求解对偶问题外部的极大化问题\n",
    "\n",
    "$$\n",
    "\\max_w\\Psi(w)\n",
    "$$\n",
    "\n",
    "将其解记为$w^*$，即\n",
    "\n",
    "$$\n",
    "w^*=\\arg\\max_w\\Psi(w)\n",
    "$$\n",
    "\n",
    "这就是说，可以应用最优化算法求对偶函数$\\Psi(w)$的极大化，得到$w^*$，用来表示$P^*\\in\\mathcal(C)$．这里，$P^*=P_{w^*}=P_{w^*}(y|x)$是学习到的最优模型（最大熵模型）. 也就是说，最大熵模型的学习归结为对偶函数$\\Psi(w)$的极大化.\n",
    "\n",
    "---\n",
    "### 例8.2\n",
    "\n",
    "学习例8.1中的最大熵模型.\n",
    "\n",
    "**解**  为了方便，分别以$y_1,y_2,y_3,y_4,y_5$，表示$A,B,C,D,E$，于是最大熵模型学习的最优化问题是\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "\\min\\quad&-H(P)=-\\sum_{i=1}^5P(y_i)\\log P(y_i)\\\\\n",
    "\\operatorname{s.t.}\\quad&P(y_1)+P(y_2)=\\tilde{P}(y_1)+\\tilde{P}(y_2)=\\frac{3}{10}\\\\\n",
    "&\\sum_{i=1}^5P(y_i)=\\sum_{i=1}^5\\tilde{P}(y_i)=1\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "引进拉格朗日乘子$w_0,w_1$，定义拉格朗日函数\n",
    "\n",
    "$$\n",
    "L(P,w)=\\sum_{i=1}^5P(y_i)\\log P(y_i)+w_1(P(y_1)+P(y_2)-\\frac{3}{10})+w_0(\\sum_{i=1}^5P(y_i)-1)\n",
    "$$\n",
    "\n",
    "根据拉格朗目对偶性，可以通过求解对偶最优化问题得到原始最优化问题的解，所以求解\n",
    "\n",
    "$$\n",
    "\\max_w\\min_PL(P,w)\n",
    "$$\n",
    "\n",
    "首先求解$L(P,w)$关于$P$的极小化问题. 为此，固定$w_0,w_1$，求偏导数\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "&\\frac{\\partial L(P,w)}{\\partial P(y_1)}=1+\\log P(y_1)+w_1+w_0\\\\\n",
    "&\\frac{\\partial L(P,w)}{\\partial P(y_2)}=1+\\log P(y_2)+w_1+w_0\\\\\n",
    "&\\frac{\\partial L(P,w)}{\\partial P(y_3)}=1+\\log P(y_3)+w_0\\\\\n",
    "&\\frac{\\partial L(P,w)}{\\partial P(y_4)}=1+\\log P(y_4)+w_0\\\\\n",
    "&\\frac{\\partial L(P,w)}{\\partial P(y_5)}=1+\\log P(y_5)+w_0\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "令各偏导数等于0，解得\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "&P(y_1)=P(y_2)=\\exp(-w_1-w_0-1)\\\\\n",
    "&P(y_3)=P(y_4)=P(y_5)=\\exp(-w_0-1)\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "于是，\n",
    "\n",
    "$$\n",
    "\\min_pL(P,w)=L(P_w,w)=-2e^{-w_1-w_0-1}-3e^{-w_0-1}-\\frac{3}{10}w_1-w_0\n",
    "$$\n",
    "\n",
    "再求解$L(P_w,w)$关于$w$的极大化问题\n",
    "\n",
    "$$\n",
    "\\max_wL(P_w,w)=-2e^{-w_1-w_0-1}-3e^{-w_0-1}-\\frac{3}{10}w_1-w_0\n",
    "$$\n",
    "\n",
    "分别求$L(P_w,w)$对$w_0,w_1$的偏导数并令其为0，得到\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "&e^{-w_1-w_0-1}=\\frac{3}{20}\\\\\n",
    "&e^{-w_0-1}=\\frac{7}{30}\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "于是得到所要求的概率分布为\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "&P(y_1)=P(y_2)=\\frac{3}{20}\\\\\n",
    "&P(y_3)=P(y_4)=P(y_5)=\\frac{7}{30}\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "---\n",
    "\n",
    "## 8.2.4 极大似然估计\n",
    "\n",
    "从以上最大熵模型学习中可以看出，最大熵模型是条件概率分布. 下面证明对偶函数的极大化等价于最大熵模型的极大似然估计.\n",
    "\n",
    "已知训练数据的经验概率分布$\\tilde{P}(X,Y)$，条件概率分布$P(Y|X)$的对数似然函数表示为\n",
    "\n",
    "$$\n",
    "L_{\\tilde{P}}(P_w)=\\log\\prod_{x,y}P(y|x)^{\\tilde{P}(x,y)}=\\sum_{x,y}\\tilde{P}(x,y)\\log P(y|x)\n",
    "$$\n",
    "\n",
    "当条件概率分布$P(y|x)$是最大熵模型时，对数似然函数$L_{\\tilde{P}}(P_w)$为\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "L_{\\tilde{P}}(P_w)&=\\sum_{x,y}\\tilde{P}(x,y)\\log P(y|x)\\\\\n",
    "&=\\sum_{x,y}\\tilde{P}(x,y)\\sum^n_{i=1}w_if_i(x,y)-\\sum_{x,y}\\tilde{P}(x,y)\\log Z_w(x)\\\\\n",
    "&=\\sum_{x,y}\\tilde{P}(x,y)\\sum^n_{i=1}w_if_i(x,y)-\\sum_{x}\\tilde{P}(x)\\log Z_w(x)\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "再看对偶函数$\\Psi(w)$．可得\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "\\Psi(w)=&\\sum_{x,y}\\tilde{P}(x)P_w(y|x)\\log P_w(y|x)+\\sum_{i=1}^nw_i(\\sum_{x,y}\\tilde{P}(x,y)f_i(x,y)-\\sum_{x,y}\\tilde{P}(x)\\tilde{P}_w(y|x)f_i(x,y))\\\\\n",
    "=&\\sum_{x,y}\\tilde{P}(x,y)\\sum^n_{i=1}w_if_i(x,y)-\\sum_{x}\\tilde{P}(x)\\log Z_w(x)\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "最后一步用到$\\sum_y P(y|x)=1$.\n",
    "\n",
    "$$\n",
    "\\Psi(w)=L_{\\tilde{P}}(P_w)\n",
    "$$\n",
    "\n",
    "既然对偶函数平$\\Psi(w)$等价于对数似然函数$L_{\\tilde{P}}(P_w)$，于是证明了最大熵模型学习中的对偶函数极大化等价于最大熵模型的极大似然估计这一事实.\n",
    "\n",
    "这样，最大熵模型的学习问题就转换为具体求解对数似然函数极大化或对偶函数极大化的问题.\n",
    "\n",
    "可以将最大熵模型写成更一般的形式.\n",
    "\n",
    "$$\n",
    "P_w(y|x)=\\frac{1}{Z_w(x)}\\exp(\\sum_{i=1}^nw_if_i(x,y))\n",
    "$$\n",
    "\n",
    "其中\n",
    "\n",
    "$$\n",
    "Z_w(x)=\\sum_y\\exp(\\sum_{i=1}^nw_if_i(x,y))\n",
    "$$\n",
    "\n",
    "这里，$x\\in\\mathbb{R}^n$为输入，$y\\in\\{1,2,\\ldots,K\\}$为输出，$w\\in\\mathbb{R}^n$为权值向量，$f_i(x,y),i=1,2,\\ldots,n$为任意实值特征函数.\n",
    "\n",
    "最大熵模型与Logistic回归模型有类似的形式，它们又称为对数线性模型（log linear model）.模型学习就是在给定的训练数据条件下对模型进行极大似然估计或正则化的极大似然估计.\n",
    "\n",
    "# 8.3 模型学习的最优化算法\n",
    "\n",
    "Logistic回归模型、最大熵模型学习归结为以似然函数为目标函数的最优化问题，通常通过迭代算法求解. 从最优化的观点看，这时的目标函数具有很好的性质. 它是光滑的凸函数，因此多种最优化的方法都适用，保证能找到全局最优解. 常用的方法有改进的迭代尺度法、梯度下降法、牛顿法或拟牛顿法. 牛顿法或拟牛顿法一般收敛速度更快.\n",
    "\n",
    "## 8.3.1 改进的迭代尺度法\n",
    "\n",
    "改进的迭代尺度法（improved iterative scaling，IIS）是一种最大熵模型学习的最优化算法.\n",
    "\n",
    "已知最大熵模型为\n",
    "\n",
    "$$\n",
    "P_w(y|x)=\\frac{1}{Z_w(x)}\\exp(\\sum_{i=1}^nw_if_i(x,y))\n",
    "$$\n",
    "\n",
    "其中\n",
    "\n",
    "$$\n",
    "Z_w(x)=\\sum_y\\exp(\\sum_{i=1}^nw_if_i(x,y))\n",
    "$$\n",
    "\n",
    "对数似然函数为\n",
    "\n",
    "$$\n",
    "L(w)=\\sum_{x,y}\\tilde{P}(x,y)\\sum_{i=1}^nw_if_i(x,y)-\\sum_x\\tilde{P}(x)\\log Z_w(x)\n",
    "$$\n",
    "\n",
    "目标是通过极大似然估计学习模型参数，即求对数似然函数的极大值w.\n",
    "\n",
    "IIS的想法是: 假设最大熵模型当前的参数向量是$w=(w_1,w_2,\\ldots,w_n)^T$，我们希望找到一个新的参数向量$w+\\delta=(w_1+\\delta_1,w_2+\\delta_2,\\ldots,w_n+\\delta_n)^T$，使得模型的对数似然函数值增大．如果能有这样一种参数向量更新的方法$\\tau:w\\rightarrow w+\\delta$，那么就可以重复使用这一方法，直至找到对数似然函数的最大值.\n",
    "\n",
    "对于给定的经验分布$\\tilde{P}(x,y)$，模型参数从$w$到$w+\\delta$，对数似然函数的改变量是\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "L(w+\\delta)-L(w)&=\\sum_{x,y}\\tilde{P}(x,y)\\log P_{w+\\delta}(y|x)-\\sum_{x,y}\\tilde{P}(x,y)\\log P_w(y|x)\n",
    "&=\\sum_{x,y}\\tilde{P}(x,y)\\sum_{i=1}^n\\delta_if_i(x,y)-\\sum_{x}\\tilde{P}(x)\\log\\frac{Z_{w+\\delta}(x)}{Z_w(x)}\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "利用不等式\n",
    "\n",
    "$$\n",
    "-\\log\\alpha \\geq 1-\\alpha,\\quad \\alpha>0\n",
    "$$\n",
    "\n",
    "建立对数似然函数改变量的下界\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "L(w+\\delta)-L(w)&\\geq\\sum_{x,y}\\tilde{P}(x,y))\\sum_{i=1}^n\\delta_if_i(x,y)+1-\\sum_{x}\\tilde{P}(x)\\log\\frac{Z_{w+\\delta}(x)}{Z_w(x)}\n",
    "&=\\sum_{x,y}\\tilde{P}(x,y))\\sum_{i=1}^n\\delta_if_i(x,y)+1-\\sum_{x}\\tilde{P}(x)\\sum_yP_w(y|x)\\exp\\sum^n_{i=1}\\delta_if_i(x,y)\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "将右端记为\n",
    "\n",
    "$$\n",
    "A(\\delta|w)=\\sum_{x,y}\\tilde{P}(x,y))\\sum_{i=1}^n\\delta_if_i(x,y)+1-\\sum_{x}\\tilde{P}(x)\\sum_yP_w(y|x)\\exp\\sum^n_{i=1}\\delta_if_i(x,y)\n",
    "$$\n",
    "\n",
    "于是有\n",
    "\n",
    "$$\n",
    "L(w+\\delta)-L(w)\\geq A(\\delta|w)\n",
    "$$\n",
    "\n",
    "即$A(\\delta|w)$是对数似然函数改变量的一个下界.\n",
    "\n",
    "如果能找到适当的$\\delta$使下界$A(\\delta|w)$提高，那么对数似然函数也会提高. 然而，函数$A(\\delta|w)$中的$\\delta$是一个向量，含有多个变量，不易同时优化. IIS试图一次只优化其中一个变量$\\delta_i$，而固定其他变量$\\delta_j,i\\neq j$.\n",
    "\n",
    "为达到这一目的，IIS进一步降低下界$A(\\delta|w)$. 具体地，IIS引进一个量$f^\\#(x,y)$\n",
    "\n",
    "$$\n",
    "f^\\#(x,y)=\\sum_if_i(x,y)\n",
    "$$\n",
    "\n",
    "因为$f_i$是二值函数，故$f^\\#(x,y)$表示所有特征在$(x,y)$出现的次数. 这样，$A(\\delta|w)$\n",
    "\n",
    "可以改写为\n",
    "\n",
    "$$\n",
    "A(\\delta|w)=\\sum_{x,y}\\tilde{P}(x,y))\\sum_{i=1}^n\\delta_if_i(x,y)+1-\\sum_{x}\\tilde{P}(x)\\sum_yP_w(y|x)\\exp(f^\\#(x,y)\\sum_{i=1}^n\\frac{\\delta_i  f_i(x,y)}{f^\\#(x,y)})\n",
    "$$\n",
    "\n",
    "利用指数函数的凸性以及对任意i，有$\\frac{f_i(x,y)}{f^\\#(x,y)}\\geq 0$这一事实，根据Jensen不等式，得到\n",
    "\n",
    "$$\n",
    "\\exp(\\sum_{i=1}^n\\frac{f_i(x,y)}{f^\\#(x,y)}\\delta_i f^\\#(x,y))\\leq \\sum_{i=1}^n\\frac{f_i(x,y)}{f^\\#(x,y)}\\exp(\\delta_i f^\\#(x,y))\n",
    "$$\n",
    "\n",
    "于是\n",
    "\n",
    "$$\n",
    "A(\\delta|w)\\geq\\sum_{x,y}\\tilde{P}(x,y))\\sum_{i=1}^n\\delta_if_i(x,y)+1-\\sum_{x}\\tilde{P}(x)\\sum_yP_w(y|x)\\sum_{i=1}^n\\frac{f_i(x,y)}{f^\\#(x,y)}\\exp(\\delta_i f^\\#(x,y))\n",
    "$$\n",
    "\n",
    "记不等式右端为\n",
    "\n",
    "$$\n",
    "B(\\delta|w)=\\sum_{x,y}\\tilde{P}(x,y))\\sum_{i=1}^n\\delta_if_i(x,y)+1-\\sum_{x}\\tilde{P}(x)\\sum_yP_w(y|x)\\sum_{i=1}^n\\frac{f_i(x,y)}{f^\\#(x,y)}\\exp(\\delta_i f^\\#(x,y))\n",
    "$$\n",
    "\n",
    "于是得到\n",
    "\n",
    "$$\n",
    "L(w+\\delta)-L(w)\\geq B(\\delta|w)\n",
    "$$\n",
    "\n",
    "这里，$B(\\delta|w)$是对数似然函数改变量的一个新的（相对不紧的）下界.\n",
    "\n",
    "求$B(\\delta|w)$对$\\delta_i$的偏导数\n",
    "\n",
    "$$\n",
    "\\frac{\\partial B(\\delta|w)}{\\partial\\delta_i}=\\sum_{x,y}\\tilde{P}(x,y))f_i(x,y)+1-\\sum_{x}\\tilde{P}(x)\\sum_yP_w(y|x)f_i(x,y)\\exp(\\delta_i f^\\#(x,y))\n",
    "$$\n",
    "\n",
    "除$\\delta_i$外不含任何其他变量．令偏导数为0得到\n",
    "\n",
    "$$\n",
    "\\sum_{x}\\tilde{P}(x)P_w(y|x)f_i(x,y)\\exp(\\delta_i f^\\#(x,y))=E_{\\tilde{P}}(f_i)\n",
    "$$\n",
    "\n",
    "于是，依次对$\\delta_i$求解可以求出$\\delta$.\n",
    "\n",
    "这就给出了一种求$w$的最优解的迭代算法，即改进的迭代尺度算法IIS.\n",
    "\n",
    "---\n",
    "\n",
    "**算法8.1（改进的迭代尺度算法IIS）**\n",
    "\n",
    "输入: 特征函数$f_1,f_2,\\ldots,f_n$; 经验分布$\\tilde{P}(X,Y)$，模型$P_w(y|x)$\n",
    "\n",
    "输出: 最优参数值$w_i^*$; 最优模型$P_{w^*}$\n",
    "\n",
    "（1）对所有$i\\in\\{1,2,\\ldots,n\\}$，取初值$w_i=0$\n",
    "\n",
    "（2）对每一$i\\in\\{1,2,\\ldots,n\\}$:\n",
    "\n",
    "- （a）令$\\delta_i$是方程\n",
    "\n",
    "$$\n",
    "\\sum_{x}\\tilde{P}(x)P_w(y|x)f_i(x,y)\\exp(\\delta_i f^\\#(x,y))=E_{\\tilde{P}}(f_i)\n",
    "$$\n",
    "\n",
    "的解，这里\n",
    "\n",
    "$$\n",
    "f^\\#(x,y)=\\sum_{i=1}^nf_i(x,y)\n",
    "$$\n",
    "\n",
    "- （b）更新$w_i$值: $w_i\\leftarrow w_i+\\delta_i$\n",
    "\n",
    "（3）如果不是所有$w_i$，都收敛，重复步（2）.\n",
    "\n",
    "---\n",
    "\n",
    "这一算法关键的一步是（a），即求解$\\delta_i$. 如果$f^\\#(x,y)$是常数，即对任何$x$，$y$，有$f^\\#(x,y)=M$，那么$\\delta_i$可以显式地表示成\n",
    "\n",
    "$$\n",
    "\\delta_i=\\frac{1}{M}\\log \\frac{E_{\\tilde{P}}(f_i)}{E_P(f_i)}\n",
    "$$\n",
    "\n",
    "如果$f^\\#(x,y)$不是常数，那么必须通过数值计算求$\\delta_i$. 简单有效的方法是牛顿法．牛顿法通过迭代求得$\\delta_i^*$，使得$g(\\delta_i^*)=0$．迭代公式是\n",
    "\n",
    "$$\n",
    "\\delta_i^{(k+1)}=\\delta_i^{(k)}-\\frac{g(\\delta_i^{(k)})}{g^\\prime(\\delta_i^{(k)})}\n",
    "$$\n",
    "\n",
    "只要适当选取初始值$\\delta_i^{(0)}$，由于$\\delta_i$有单根，因此牛顿法恒收敛，而且收敛速度很快.\n",
    "\n",
    "## 8.3.2 拟牛顿法\n",
    "\n",
    "最大熵模型学习还可以应用牛顿法或拟牛顿法.\n",
    "\n",
    "对于最大熵模型而言\n",
    "\n",
    "$$\n",
    "P_w(y|x)=\\frac{\\exp(\\sum_{i=1}^nw_if_i(x,y))}{\\sum_y\\exp(\\sum^n_{i=1}w_if_i(x,y))}\n",
    "$$\n",
    "\n",
    "目标函数\n",
    "\n",
    "$$\n",
    "\\min_{w\\in\\mathbb{R}^n}f(w)=\\sum_x\\tilde{P}(x)\\log\\sum_y\\exp(\\sum^n_{i=1}w_if_i(x,y))-\\sum_{x,y}\\tilde{P}(x,y)\\sum^n_{i=1}w_if_i(x,y)\n",
    "$$\n",
    "\n",
    "梯度\n",
    "\n",
    "$$\n",
    "g(w)=(\\frac{\\partial f(w)}{\\partial w_1},\\frac{\\partial f(w)}{\\partial w_2},\\cdots,\\frac{\\partial f(w)}{\\partial w_n})^T\n",
    "$$\n",
    "\n",
    "其中\n",
    "\n",
    "$$\n",
    "\\frac{\\partial f(w)}{\\partial w_i}=\\sum_{x,y}\\tilde{P}(x)P_w(y|x)f_i(x,y)-E_{\\tilde{P}}(f_i),\\;,i=1,2,\\ldots,n\n",
    "$$\n",
    "\n",
    "相应的拟牛顿法BFGS算法如下.\n",
    "\n",
    "---\n",
    "\n",
    "**算法8.2（最大熵模型学习的 BFGS 算法）**\n",
    "\n",
    "输入: 特征函数$f_1,f_2,\\ldots,f_n$; 经验分布$\\tilde{P}(x,y)$，目标函数$f(w)$，梯度$g(w)=\\nabla f(w)$精度要求$\\varepsilon$\n",
    "\n",
    "输出: 最优参数值$w_*$; 最优模型$P_{w^*}(y|x)$\n",
    "\n",
    "（1）选定初始点$w^{(0)}$，取$B_0$为正定对称矩阵，置$k=0$\n",
    "\n",
    "（2）计算$g_k=g(w^{(k)})$.若$\\|g_k\\|<\\varepsilon$，则停止计算，得$w^*=w^{(k)}$; 否则转（3）\n",
    "\n",
    "（3）由$B_kp_k=-g_k$求出$p_k$\n",
    "\n",
    "（4）一维搜索: 求$\\lambda$使得\n",
    "\n",
    "$$\n",
    "f(w^{(k)}+\\lambda_k p_k)=\\min_{\\lambda\\geq0}f(w^{(k)}+\\lambda p_k)\n",
    "$$\n",
    "\n",
    "（5）置$w^{(k+1)}=w^{(k)}+\\lambda_k p_k$\n",
    "\n",
    "（6）计算$g_{k+1}=g(w^{(k+1)})$，若$\\|g_{k+1}\\|<\\varepsilon$，则停止计算，得$w^*=w^{(k+1)}$; 否则，按下式求出$b_{k+1}$\n",
    "\n",
    "$$\n",
    "B_{k+1}=B_k+\\frac{y_ky^T_k}{y_K^T\\delta_k}-\\frac{b_k\\delta_k\\delta_k^TB_k}{\\delta_K^TB_kB_k}\n",
    "$$\n",
    "\n",
    "其中\n",
    "\n",
    "$$\n",
    "y_k=g_{k+1}-g_k,\\;\\delta_k=w^{(k+1)}-w^{(k)}\n",
    "$$\n",
    "\n",
    "（7）置$k=k+1$，转（3）\n",
    "\n",
    "## 8.3.3 梯度下降法\n",
    "\n",
    "对于Logistic模型\n",
    "\n",
    "$$\n",
    "P(Y=1 | x)=\\frac{\\exp (w \\cdot x+b)}{1+\\exp (w \\cdot x+b)} \\\\ \n",
    "P(Y=0 | x)=\\frac{1}{1+\\exp (w \\cdot x+b)}\n",
    "$$\n",
    "\n",
    "对数似然函数为：$\\displaystyle L(w)=\\sum_{i=1}^N \\left[y_i (w \\cdot x_i)-\\log \\left(1+\\exp (w \\cdot x_i)\\right)\\right]$  \n",
    "\n",
    "似然函数求偏导，可得$\\displaystyle \\frac{\\partial L(w)}{\\partial w^{(j)}}=\\sum_{i=1}^N\\left[x_i^{(j)} \\cdot y_i-\\frac{\\exp (w \\cdot x_i) \\cdot x_i^{(j)}}{1+\\exp (w \\cdot x_i)}\\right]$  \n",
    "\n",
    "梯度函数为：$\\displaystyle \\nabla L(w)=\\left[\\frac{\\partial L(w)}{\\partial w^{(0)}}, \\cdots, \\frac{\\partial L(w)}{\\partial w^{(m)}}\\right]$  \n",
    "\n",
    "---\n",
    "\n",
    "**算法8.3（Logistic回归模型学习的梯度下降算法）**\n",
    "\n",
    "输入: 目标函数$f(w)$，梯度$g(w)=\\nabla f(w)$，精度要求$\\varepsilon$\n",
    "\n",
    "输出: $f(x)$的极小点$x_*$\n",
    "\n",
    "(1) 取初始值$x^{(0)} \\in R$，置$k=0$  \n",
    "\n",
    "(2) 计算$f(x^{(k)})$  \n",
    "\n",
    "(3) 计算梯度$g_k=g(x^{(k)})$，当$\\|g_k\\| < \\varepsilon$时，停止迭代，令$x^* = x^{(k)}$；否则，求$\\lambda_k$，使得$\\displaystyle f(x^{(k)}+\\lambda_k g_k) = \\max_{\\lambda \\geqslant 0}f(x^{(k)}+\\lambda g_k)$ \n",
    "\n",
    "(4) 置$x^{(k+1)}=x^{(k)}+\\lambda_k g_k$，计算$f(x^{(k+1)})$，当$\\|f(x^{(k+1)}) - f(x^{(k)})\\| < \\varepsilon$或 $\\|x^{(k+1)} - x^{(k)}\\| < \\varepsilon$时，停止迭代，令$x^* = x^{(k+1)}$  \n",
    "\n",
    "(5) 否则，置$k=k+1$，转(3)\n",
    "\n",
    "---\n",
    "\n",
    "当目标函数是凸函数时，梯度下降法的解是全局最优解. 一般情况下，其解不保证是全局最优解. 梯度下降法的收敛速度也未必是很快的."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import time\n",
    "import matplotlib.pyplot as plt\n",
    "from mpl_toolkits.mplot3d import Axes3D\n",
    "from pylab import mpl\n",
    "\n",
    "# 图像显示中文\n",
    "mpl.rcParams['font.sans-serif'] = ['Microsoft YaHei']\n",
    "\n",
    "\n",
    "class LogisticRegression:\n",
    "    def __init__(self, learn_rate=0.1, max_iter=10000, tol=1e-2):\n",
    "        self.learn_rate = learn_rate  # 学习率\n",
    "        self.max_iter = max_iter  # 迭代次数\n",
    "        self.tol = tol  # 迭代停止阈值\n",
    "        self.w = None  # 权重\n",
    "\n",
    "    def preprocessing(self, X):\n",
    "        \"\"\"将原始X末尾加上一列，该列数值全部为1\"\"\"\n",
    "        row = X.shape[0]\n",
    "        y = np.ones(row).reshape(row, 1)\n",
    "        X_prepro = np.hstack((X, y))\n",
    "        return X_prepro\n",
    "\n",
    "    def sigmod(self, x):\n",
    "        return 1 / (1 + np.exp(-x))\n",
    "\n",
    "    def fit(self, X_train, y_train):\n",
    "        X = self.preprocessing(X_train)\n",
    "        y = y_train.T\n",
    "        # 初始化权重w\n",
    "        self.w = np.array([[0] * X.shape[1]], dtype=np.float)\n",
    "        k = 0\n",
    "        for loop in range(self.max_iter):\n",
    "            # 计算梯度\n",
    "            z = np.dot(X, self.w.T)\n",
    "            grad = X * (y - self.sigmod(z))\n",
    "            grad = grad.sum(axis=0)\n",
    "            # 利用梯度的绝对值作为迭代中止的条件\n",
    "            if (np.abs(grad) <= self.tol).all():\n",
    "                break\n",
    "            else:\n",
    "                # 更新权重w 梯度上升——求极大值\n",
    "                self.w += self.learn_rate * grad\n",
    "                k += 1\n",
    "        print(\"迭代次数：{}次\".format(k))\n",
    "        print(\"最终梯度：{}\".format(grad))\n",
    "        print(\"最终权重：{}\".format(self.w[0]))\n",
    "\n",
    "    def predict(self, x):\n",
    "        p = self.sigmod(np.dot(self.preprocessing(x), self.w.T))\n",
    "        print(\"Y=1的概率被估计为：{:.2%}\".format(p[0][0]))  # 调用score时，注释掉\n",
    "        p[np.where(p > 0.5)] = 1\n",
    "        p[np.where(p < 0.5)] = 0\n",
    "        return p\n",
    "\n",
    "    def score(self, X, y):\n",
    "        y_c = self.predict(X)\n",
    "        error_rate = np.sum(np.abs(y_c - y.T)) / y_c.shape[0]\n",
    "        return 1 - error_rate\n",
    "\n",
    "    def draw(self, X, y):\n",
    "        # 分离正负实例点\n",
    "        y = y[0]\n",
    "        X_po = X[np.where(y == 1)]\n",
    "        X_ne = X[np.where(y == 0)]\n",
    "        # 绘制数据集散点图\n",
    "        ax = plt.axes(projection='3d')\n",
    "        x_1 = X_po[0, :]\n",
    "        y_1 = X_po[1, :]\n",
    "        z_1 = X_po[2, :]\n",
    "        x_2 = X_ne[0, :]\n",
    "        y_2 = X_ne[1, :]\n",
    "        z_2 = X_ne[2, :]\n",
    "        ax.scatter(x_1, y_1, z_1, c=\"r\", label=\"正实例\")\n",
    "        ax.scatter(x_2, y_2, z_2, c=\"b\", label=\"负实例\")\n",
    "        ax.legend(loc='best')\n",
    "        # 绘制p=0.5的区分平面\n",
    "        x = np.linspace(-3, 3, 3)\n",
    "        y = np.linspace(-3, 3, 3)\n",
    "        x_3, y_3 = np.meshgrid(x, y)\n",
    "        a, b, c, d = self.w[0]\n",
    "        z_3 = -(a * x_3 + b * y_3 + d) / c\n",
    "        ax.plot_surface(x_3, y_3, z_3, alpha=0.5)  # 调节透明度\n",
    "        plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 训练数据集\n",
    "X_train = np.array([[3, 3, 3], [4, 3, 2], [2, 1, 2], [1, 1, 1], [-1, 0, 1],\n",
    "                    [2, -2, 1]])\n",
    "y_train = np.array([[1, 1, 1, 0, 0, 0]])\n",
    "# 构建实例，进行训练\n",
    "clf = LogisticRegression()\n",
    "clf.fit(X_train, y_train)\n",
    "clf.draw(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 小结\n",
    "\n",
    "1．Logistic回归模型是由以下条件概率分布表示的分类模型。Logistic回归模型可以用于二类或多类分类。\n",
    "\n",
    "$$P(Y=k | x)=\\frac{\\exp \\left(w_{k} \\cdot x\\right)}{1+\\sum_{k=1}^{K-1} \\exp \\left(w_{k} \\cdot x\\right)}, \\quad k=1,2, \\cdots, K-1$$\n",
    "\n",
    "$$P(Y=K | x)=\\frac{1}{1+\\sum_{k=1}^{K-1} \\exp \\left(w_{k} \\cdot x\\right)}$$\n",
    "这里，$x$为输入特征，$w$为特征的权值。\n",
    "\n",
    "Logistic回归模型源自Logistic分布，其分布函数$F(x)$是$S$形函数。Logistic回归模型是由输入的线性函数表示的输出的对数几率模型。\n",
    "\n",
    "2．最大熵模型是由以下条件概率分布表示的分类模型。最大熵模型也可以用于二类或多类分类。\n",
    "\n",
    "$$P_{w}(y | x)=\\frac{1}{Z_{w}(x)} \\exp \\left(\\sum_{i=1}^{n} w_{i} f_{i}(x, y)\\right)$$\n",
    "$$Z_{w}(x)=\\sum_{y} \\exp \\left(\\sum_{i=1}^{n} w_{i} f_{i}(x, y)\\right)$$\n",
    "\n",
    "其中，$Z_w(x)$是规范化因子，$f_i$为特征函数，$w_i$为特征的权值。\n",
    "\n",
    "3．最大熵模型可以由最大熵原理推导得出。最大熵原理是概率模型学习或估计的一个准则。最大熵原理认为在所有可能的概率模型（分布）的集合中，熵最大的模型是最好的模型。\n",
    "\n",
    "最大熵原理应用到分类模型的学习中，有以下约束最优化问题：\n",
    " \n",
    "$$\n",
    "\\begin{align}\n",
    "\\min\\quad&H(P)=-\\sum_{x,y}\\tilde{P}(x)P(y|x)\\log P(y|x)\\\\\n",
    "\\operatorname{s.t.}\\quad&E_P(f_i)-E_{\\tilde{P}}(f_i)=0,\\;i=1,2,\\ldots,n\\\\\n",
    "&\\sum_yP(y|x)=1\n",
    "\\end{align}\n",
    "$$\n",
    " \n",
    "求解此最优化问题的对偶问题得到最大熵模型。\n",
    "\n",
    "4．Logistic回归模型与最大熵模型都属于对数线性模型。\n",
    "\n",
    "5．Logistic回归模型及最大熵模型学习一般采用极大似然估计，或正则化的极大似然估计。Logistic回归模型及最大熵模型学习可以形式化为无约束最优化问题。求解该最优化问题的算法有改进的迭代尺度法、梯度下降法、拟牛顿法。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
